The distillation method: A novel approach for analyzing randomized trials when exposure to the intervention is diluted

Abstract Objective To introduce a novel analytical approach for randomized controlled trials that are underpowered because of low participant enrollment or engagement. Data Sources Reanalysis of data for 805 patients randomized as part of a pilot complex care intervention in 2015–2016 in a large delivery system. In the pilot randomized trial, only 64.6% of patients assigned to the intervention group participated. Study Design A case study and simulation. The “Distillation Method” capitalizes on the frequently observed correlation between the probability of subjects' participation or engagement in the intervention and the magnitude of benefit they experience. The novel method involves three stages: first, it uses baseline covariates to generate predicted probabilities of participation. Next, these are used to produce nested subsamples of the randomized intervention and control groups that are more concentrated with subjects who were likely to participate/engage. Finally, for the outcomes of interest, standard statistical methods are used to re‐evaluate intervention effectiveness in these concentrated subsets. Data Extraction Methods We assembled secondary data on patients who were randomized to the pilot intervention for one year prior to randomization and two follow‐up years. Data included program enrollment status, membership data, demographics, utilization, costs, and clinical data. Principal Findings Using baseline covariates only, Generalized Boosted Regression Models predicting program enrollment performed well (AUC 0.884). We then distilled the full randomized sample to increasing levels of concentration and reanalyzed program outcomes. We found statistically significant differences in outpatient utilization and emergency department utilization (both follow‐up years), and in total costs (follow‐up year two only) at select levels of population concentration. Conclusions By offering an internally valid analytic framework, the Distillation Method can increase the power to detect effects by redefining the estimand to subpopulations with higher enrollment probabilities and stronger average treatment effects while maintaining the original randomization.

estimator that maximizes statistical power. The estimand is still an ATE but it is an ATE for a subset of the population in the original RCT. The estimator is still a typical treatment effect estimator but applied to a subset of the participants in the RCT. Conceptually the method is a very ordinary estimation approach applied to an empirically defined population that is a subset of the original RCT population.
The additional assumptions the Distillation Method requires are that there be a relationship between treatment uptake and the magnitude of the treatment effect and there is at least one pretreatment predictor of treatment uptake. Although we motivate the plausible existence of such cases in this manuscript by discussing motivations of subjects and recruiters the method only requires these properties. The method is not dependent on the relationships being generated by this posited mechanism.
As implemented here, failing to satisfy these assumptions has the harmless property of defaulting to the original ITT estimator. In the simulations presented below the estimator simply returns the original ITT results and the method may be abandoned. Note also that the method does not seem to increase the type I error rate in the cases where there is no treatment effect to be found. Under the assumptions used to generate the simulation data there does not seem to be any danger of using the method to capitalize on chance or engage in "p hacking." Note that if there is no correlation between the probability of uptake and treatment effect or there are no predictors of uptake the estimands of the distillation estimate and the original ITT estimate are the same. This is the situation where a Hausman specification test 1 is straightforward. Regardless of the treatment effect estimator employed the difference between the estimates for ITT and distillation is a test statistic. Because these estimates will in general be correlated a standard error estimate for the difference would generally require a bootstrap calculation. Bootstrap standard errors also free the analyst to select the most appropriate statistical model for the outcomes. This approach can be applied if the outcome model is a cost model or a logistic regression or any other model of interest. In practice a practical difference between the estimates with a small enough standard error for the distillation estimate should suffice.

Simulation design
The above discussion suggests that the simulation must, at a minimum, have three underlying variables and a mechanism to correlate them. The three quantities required: 1) a treatment effect, 2) an uptake effect, and 3) a predictor of uptake. For simulation purposes the correlation between the variables will be induced by generating them from a trivariate normal distribution. The framework is similar in form to the probit style mechanisms used to study limited dependent variables. See, for example, Maddala 2 . The simulation study below is designed as a 6 by 4 3 by 11 2 factorial design with 6 levels of treatment effect, 4 levels of pairwise correlation between the variables in the generating trivariate normal, and 11 values of both the uptake rate and the refractory fraction. Several other potentially variable parameters (e.g. sample size) are set to single values that correspond to the features of the example. These values are discussed below.
Researchers applying the Distillation Method may wish to run their own customized version of these simulations. This would be particularly useful for more complicated second stage models like two-part cost models. The simulation code is available from the authors.

Treatment effect:
A person level (heterogenous) treatment effect and outcome: We start with a "per protocol" person level heterogeneous treatment effect. This is the benefit, in expectation, that would occur if the subject took the treatment.

Uptake effect:
The subjects in the treatment group only receive the treatment effect if they take the treatment.
Their actual treatment effect is: This is the hypothetical treatment effect. The observed outcome adds a normal error term: Where 2 is 1.0 This is meant to mimic a log normal cost analysis. If the interest where to model a binary outcome a probit generating process could be substituted here. Even more elaborate generating processes are possible.
A person level uptake effect: In these simulations we used values of UptakeRate from 0 to 1.0 by 0.1. The generation process is defined below.
A predictor of upt ake: is a normal random variable. This represents the predictions of the first stage uptake model. Regardless of the functional form of the model it is the correlations between the predictions from the model and the uptake and treatment effects that produces the distillation effect.

The joint distribution generating mechanism:
The joint distribution of , , and is trivariate normal: � �~N�� The data sets have sample sizes of 400 in both treatment and control. This value was selected to match the example.